Code
# set options
options(stringsAsFactors = F)
# install libraries
install.packages("stringdist")
install.packages("hashr")
install.packages("tidyverse")Dattatreya Majumdar
January 1, 2025

This tutorial introduces Text Similarity (see Zahrotun 2016; Li and Han 2013), i.e. how close or similar two pieces of text are with respect to either their use of words or characters (lexical similarity) or in terms of meaning (semantic similarity).This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to assess the similarity of texts in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with assessing text similarity.
Lexical Similarity provides a measure of the similarity of two texts based on the intersection of the word sets of same or different languages. A lexical similarity of 1 suggests that there is complete overlap between the vocabularies while a score of 0 suggests that there are no common words in the two texts. There are several different ways of evaluating lexical similarity such as Jaccard Similarity, Cosine Similarity, Levenshtein Distance etc.
Semantic Similarity on the other hand measures the similarity between two texts based on their meaning rather than their lexicographical similarity. Semantic similarity is highly useful for summarizing texts and extracting key attributes from large documents or document collections. Semantic Similarity can be evaluated using methods such as Latent Semantic Analysis (LSA), Normalised Google Distance (NGD), Salient Semantic Analysis (SSA) etc.
As a part of this tutorial we will focus primarily on Lexical Similarity. We begin with a brief overview of relevant concepts and then show different measures can be implemented in R.
The Jaccard similarity is defined as an intersection of two texts divided by the union of that two documents. In other words it can be expressed as the number of common words over the total number of the words in the two texts or documents. The Jaccard similarity of two documents ranges from 0 to 1, where 0 signifies no similarity and 1 signifies complete overlap.The mathematical representation of the Jaccard Similarity is shown below: -
\[\begin{equation} J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B |} = \frac{|A \bigcap B|}{|A| + |B| - |A \bigcap B|} \end{equation}\]
In case of cosine similarity the two documents are represented in a n-dimensional vector space with each word represented in a vector form. Thus the cosine similarity metric measures the cosine of the angle between two n-dimensional vectors projected in a multi-dimensional space. The cosine similarity ranges from 0 to 1. A value closer to 0 indicates less similarity whereas a score closer to 1 indicates more similarity.The mathematical representation of the Cosine Similarity is shown below: -
\[\begin{equation} similarity = cos(\theta) = \frac{A \cdot B}{||A|| ||B||} = \frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}} \end{equation}\]
Levenshtein distance comparison is generally carried out between two words. It determines the minimum number of single character edits required to change one word to another. The higher the number of edits more are the texts different from each other.An edit is defined by either an insertion of a character, a deletion of character or a replacement of a character. For two words a and b with lengths i and j the Levenshtein distance is defined as follows: -
\[\begin{equation} lev_{a,b}(i,j) = \begin{cases} max(i,j) & \quad \text{if min(i,j) = 0,}\\ min \begin{cases} lev_{a,b}(i-1,j)+1 \\ lev_{a,b}(i, j-1)+1 & \text{otherwise.}\\ lev_{a,b}(i-1,j-1)+1_{(a_{i} \neq b_{j})} \\ \end{cases} \end{cases} \end{equation}\]
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).
Now that we have installed the packages, we activate them as shown below.
Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.
For evaluating the similarity scores and the edit distance for the above discussed methods in R we have installed the stringdist package and will be primarily using two functions in that: stringdist and stringsim. We are also utilising the hashr package so that Jaccard and cosine similarity are evaluated word wise instead of letter wise. The sentence is tokenised and the corresponding list of words are hashed so that the sentences are transformed into a list of integers.For the Jaccard and the Cosine similarity we will be using the same set of texts whereas for the Levenshtein edit distance we will take 3 pairs of words to illustrate insert, delete and replace operations.
[1] "The Jaccard similarity for the two texts is 0.727272727272727"
[1] "The Cosine similarity for the two texts is 0.571428571428572"
[1] "The insert edit distance for Marta and Martha is 1"
[1] "The delete edit distance for Genome and Gnome is 1"
[1] "The replace edit distance for Tim and Tom is 1"
As shown above, the Jaccard and Cosine similarity scores are different which is important to note when using different measures to determine similarity. The differences are primarily primarily caused because Jaccard takes only the unique words in the two texts into consideration whereas the Cosine similarity approach takes the total length of the vectors into consideration. For the Levenshtein edit distance, the examples provided above show that for the first case we have to insert an extra h, for the second we have to delete an e and for the last case we need to replace i with o. Thus, for all the pairs taken into account here the edit distance is 1.
Dattatreya Majumdar. 2025. Introduction to Lexical Similarity. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/lexsim/lexsim.html (Version 2025.04.02), doi: .
@manual{dattatreyamajumdar2025introduction,
author = {Dattatreya Majumdar},
title = {Introduction to Lexical Similarity},
year = {2025},
note = {https://ladal.edu.au/tutorials/lexsim/lexsim.html},
organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
edition = {2025.04.02}
doi = {}
}
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Australia/Brisbane
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] flextable_0.9.7 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
[5] dplyr_1.1.4 purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
[9] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 hashr_0.1.4
[13] stringdist_0.9.15
loaded via a namespace (and not attached):
[1] generics_0.1.3 fontLiberation_0.1.0 renv_1.1.1
[4] xml2_1.3.6 stringi_1.8.4 hms_1.1.3
[7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
[10] grid_4.4.2 timechange_0.3.0 fastmap_1.2.0
[13] jsonlite_1.9.0 zip_2.3.2 scales_1.3.0
[16] fontBitstreamVera_0.1.1 codetools_0.2-20 klippy_0.0.0.9500
[19] textshaping_1.0.0 cli_3.6.4 rlang_1.1.5
[22] fontquiver_0.2.1 munsell_0.5.1 withr_3.0.2
[25] yaml_2.3.10 gdtools_0.4.1 tools_4.4.2
[28] officer_0.6.7 parallel_4.4.2 uuid_1.2-1
[31] tzdb_0.4.0 colorspace_2.1-1 assertthat_0.2.1
[34] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
[37] htmlwidgets_1.6.4 ragg_1.3.3 pkgconfig_2.0.3
[40] pillar_1.10.1 gtable_0.3.6 glue_1.8.0
[43] data.table_1.17.0 Rcpp_1.0.14 systemfonts_1.2.1
[46] xfun_0.51 tidyselect_1.2.1 rstudioapi_0.17.1
[49] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29
[52] compiler_4.4.2 askpass_1.2.1 openssl_2.3.2
This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.
---
title: "Introduction to Lexical Similarity"
author: "Dattatreya Majumdar"
date: "2025"
params:
title: "Introduction to Lexical Similarity"
author: "Dattatreya Majumdar"
year: "2025"
version: "2025.04.02"
url: "https://ladal.edu.au/tutorials/lexsim/lexsim.html"
institution: "The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia"
description: "This tutorial introduces lexical similarity analysis in R, covering string distance metrics, edit distance, and methods for comparing and clustering documents and words based on their surface forms. It is aimed at researchers in corpus linguistics, historical linguistics, and computational linguistics who need to quantify similarity between texts or lexical items."
doi: "10.5281/zenodo.19332903"
format:
html:
toc: true
toc-depth: 4
code-fold: show
code-tools: true
theme: cosmo
---
{ width=100% }
# Introduction{-}
This tutorial introduces Text Similarity [see @zahrotun2016comparison; @li2013distance], i.e. how close or similar two pieces of text are with respect to either their use of words or characters (lexical similarity) or in terms of meaning (semantic similarity).This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to assess the similarity of texts in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with assessing text similarity.
*Lexical Similarity* provides a measure of the similarity of two texts based on the intersection of the word sets of same or different languages. A lexical similarity of 1 suggests that there is complete overlap between the vocabularies while a score of 0 suggests that there are no common words in the two texts. There are several different ways of evaluating lexical similarity such as Jaccard Similarity, Cosine Similarity, Levenshtein Distance etc.
*Semantic Similarity* on the other hand measures the similarity between two texts based on their meaning rather than their lexicographical similarity. Semantic similarity is highly useful for summarizing texts and extracting key attributes from large documents or document collections. Semantic Similarity can be evaluated using methods such as *Latent Semantic Analysis* (LSA), *Normalised Google Distance* (NGD), *Salient Semantic Analysis* (SSA) etc.
As a part of this tutorial we will focus primarily on Lexical Similarity. We begin with a brief overview of relevant concepts and then show different measures can be implemented in R.
## Jaccard Similarity{-}
The Jaccard similarity is defined as an intersection of two texts divided by the union of that two documents. In other words it can be expressed as the number of common words over the total number of the words in the two texts or documents. The Jaccard similarity of two documents ranges from 0 to 1, where 0 signifies no similarity and 1 signifies complete overlap.The mathematical representation of the Jaccard Similarity is shown below: -
\begin{equation}
J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B |} = \frac{|A \bigcap B|}{|A| + |B| - |A \bigcap B|}
\end{equation}
## Cosine Similarity{-}
In case of cosine similarity the two documents are represented in a n-dimensional vector space with each word represented in a vector form. Thus the cosine similarity metric measures the cosine of the angle between two n-dimensional vectors projected in a multi-dimensional space. The cosine similarity ranges from 0 to 1. A value closer to 0 indicates less similarity whereas a score closer to 1 indicates more similarity.The mathematical representation of the Cosine Similarity is shown below: -
\begin{equation}
similarity = cos(\theta) = \frac{A \cdot B}{||A|| ||B||} = \frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}
\end{equation}
## Levenshtein Distance{-}
Levenshtein distance comparison is generally carried out between two words. It determines the minimum number of single character edits required to change one word to another. The higher the number of edits more are the texts different from each other.An edit is defined by either an insertion of a character, a deletion of character or a replacement of a character. For two words *a* and *b* with lengths *i* and *j* the Levenshtein distance is defined as follows: -
\begin{equation}
lev_{a,b}(i,j) =
\begin{cases}
max(i,j) & \quad \text{if min(i,j) = 0,}\\
min \begin{cases}
lev_{a,b}(i-1,j)+1 \\
lev_{a,b}(i, j-1)+1 & \text{otherwise.}\\
lev_{a,b}(i-1,j-1)+1_{(a_{i} \neq b_{j})} \\
\end{cases}
\end{cases}
\end{equation}
## Preparation and session set up{-}
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](/tutorials/intror/intror.html). For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).
```{r prep1, echo=T, eval = F, message=FALSE, warning=FALSE}
# set options
options(stringsAsFactors = F)
# install libraries
install.packages("stringdist")
install.packages("hashr")
install.packages("tidyverse")
```
Now that we have installed the packages, we activate them as shown below.
```{r prep2, message=FALSE, warning=FALSE, class.source='klippy'}
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
library(stringdist)
library(hashr)
library(tidyverse)
```
Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.
# Measuring Similarity in R{-}
For evaluating the similarity scores and the edit distance for the above discussed methods in R we have installed the *stringdist* package and will be primarily using two functions in that: *stringdist* and *stringsim*. We are also utilising the *hashr* package so that Jaccard and cosine similarity are evaluated word wise instead of letter wise. The sentence is tokenised and the corresponding list of words are hashed so that the sentences are transformed into a list of integers.For the Jaccard and the Cosine similarity we will be using the same set of texts whereas for the Levenshtein edit distance we will take 3 pairs of words to illustrate *insert*, *delete* and *replace* operations.
```{r librarydata, echo=T, eval = T, message=FALSE, warning=FALSE}
text1 <- "The quick brown fox jumped over the wall"
text2 <- "The fast brown fox leaped over the wall"
insert_ex <- c("Marta", "Martha")
del_ex <- c("Genome", "Gnome")
rep_ex <- c("Tim", "Tom")
```
## Jaccard Similarity{-}
```{r jac}
# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
jac_sim_score <- seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "jaccard", q = 2)
print(paste0("The Jaccard similarity for the two texts is ", jac_sim_score))
```
## Cosine Similarity{-}
```{r cos}
# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
cos_sim_score <- seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "cosine", q = 2)
print(paste0("The Cosine similarity for the two texts is ", cos_sim_score))
```
## Levenshtein distance{-}
```{r le}
# Insert edit
ins_edit <- stringdist(insert_ex[1], insert_ex[2], method = "lv")
print(paste0("The insert edit distance for ", insert_ex[1], " and ", insert_ex[2], " is ", ins_edit))
# Delete edit
del_edit <- stringdist(del_ex[1], del_ex[2], method = "lv")
print(paste0("The delete edit distance for ", del_ex[1], " and ", del_ex[2], " is ", del_edit))
# Replace edit
rep_edit <- stringdist(rep_ex[1], rep_ex[2], method = "lv")
print(paste0("The replace edit distance for ", rep_ex[1], " and ", rep_ex[2], " is ", rep_edit))
```
# Concluding remarks{-}
As shown above, the Jaccard and Cosine similarity scores are different which is important to note when using different measures to determine similarity. The differences are primarily primarily caused because Jaccard takes only the unique words in the two texts into consideration whereas the Cosine similarity approach takes the total length of the vectors into consideration. For the Levenshtein edit distance, the examples provided above show that for the first case we have to insert an extra *h*, for the second we have to delete an *e* and for the last case we need to replace *i* with *o*. Thus, for all the pairs taken into account here the edit distance is 1.
# Citation & Session Info {-}
::: {.callout-note}
## Citation
```{r citation-callout, echo=FALSE, results='asis'}
cat(
params$author, ". ",
params$year, ". *",
params$title, "*. ",
params$institution, ". ",
"url: ", params$url, " ",
"(Version ", params$version, "), ",
"doi: ", params$doi, ".",
sep = ""
)
```
```{r citation-bibtex, echo=FALSE, results='asis'}
key <- paste0(
tolower(gsub(" ", "", gsub(",.*", "", params$author))),
params$year,
tolower(gsub("[^a-zA-Z]", "", strsplit(params$title, " ")[[1]][1]))
)
cat("```\n")
cat("@manual{", key, ",\n", sep = "")
cat(" author = {", params$author, "},\n", sep = "")
cat(" title = {", params$title, "},\n", sep = "")
cat(" year = {", params$year, "},\n", sep = "")
cat(" note = {", params$url, "},\n", sep = "")
cat(" organization = {", params$institution, "},\n", sep = "")
cat(" edition = {", params$version, "}\n", sep = "")
cat(" doi = {", params$doi, "}\n", sep = "")
cat("}\n```\n")
```
:::
```{r fin}
sessionInfo()
```
::: {.callout-note}
## AI Transparency Statement
This tutorial was re-developed with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the `checkdown` quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.
:::
[Back to top](#introduction)
[Back to HOME](/index.html)
# References{-}